Notes on useful functions or operations in Python.
Basic Calculations
1 | # Exponentiation |
Basic Manipulations
Lists
Characteristics
- Name a collection of values.
d = [a,b,c]
- Can contain any type and different types
- List of lists:
d2 = [[a,1],[b,2],[c,3]]
Subsetting lists
zero-based indexing: start from 0, [start (inclusive):end (exclusive)]
x[0]
: returns the first elementx[-1]
: returns the last elementx[-2]
: returns the last but one elementx[3:5]
: returns the fourth and fifth element. The element represented by index 5 is not selected, so this slicing is like [start:end) mathemeticallyx[:4]
: returns all the element from the start to the fourthx[5:]
: returns all the element from the sixth element to the last one
List Manipulation
Replaced the indexed part of the list with desired values
1
x[0:2] = [a, b]
Adding and removing elements
1
2
3
4
5# adding
x = x + [a, b] #append a and b to the end of the list
# removing
del(x[2])
*Important Note: when you create a new list, what actually happens is that you store a list into the computer memory, and store the “address” of the list to the variable. This means that the variable actually does not contain all the list elements, but rather contains a reference to the list elements. This difference is especially important when you try to copy the list:*1
2
3x = ['a', 'b', 'c']
y = x
y[1] = 'z'
Now if you print y
, you will see the following output: ['a', 'z', 'c']
, while interestingly, the element in x
is also changed into ['a', 'z', 'c']
.
That is because when you copy x
to y
with an equal sign, you copied the reference to y
, not the list elements themselves. Therefore, when you are updating an element in the list, which was stored in the computer memory, both x
and y
, whose reference point to this list, will return changed outcome.
If you want to create a list y
with a new list of elements but same values as x
, you should use y = list(x)
or y = x[:]
to select all the elements explicitly. Now when you update the elements in y
, x
will not change accordingly.
Data Frames
Summing column-wise and row-wise1
2
3
4
5# column-wise
temp.sum()
# row-wise
temp.sum(axis = 'columns')
Exploring Dataset
Initial Examination
1 | # Import the pandas library as pd |
Merging Data Frames
1 | apple_high = pd.merge( |
Converting Data Types
1 | a = float(b) # convert into float |
When there are only a few different values in a column, it is more efficient in memory to convert it to a categorical data type. And it will also allows you to specify a logical order for the categories
1 | ri.stop_length.unique() |
Data Cleaning
Dealing with NAs
1 | # Count the number of missing values in each column |
Dealing with Data Types
Changing data types
1
2
3
4
5
6# examine the data types of all columns
data.dtypes
# change the data type of one column
data['col_a'] = data.col_a.astype('bool')
## datetime, categoricalDate time index
1
2
3
4
5
6
7
8
9
10
11
12
13
14# Concatenate date and time columns
combined = data.date.str.cat(data.time, sep = ' ')
# Convert 'combined' to datetime format
data['datetime'] = pd.to_datetime(combined)
# Setting datetime as index
data.set_index('datetime', inplace = True)
# Examine the index
data.index
# Reset the index back to a column
data.reset_index(inplace = True)
Data Manipulation
Slicing Columns and Rows
1 | # select column |
Aggregation
Using Groupby
1
2
3#groupby
a = data.num_people.mean()
a_by_month = data.groupby(data.index.month).num_people.mean()Or using multiple variables as groupby criteria to create multi-indexed series
1
2
3
4
5
6
7
8search_rate = ri.groupby(['violation', 'driver_gender']).search_conducted.mean()
# the resulted multi-indexed series is very similar to a data frame.
# we can use the loc accessor to slice it
search_rate.loc['Equipment', 'M']
# we can also convert it to a data frame by unstacking it
search_rate.unstack()Using Pivot Table
To save the trouble of groupby and then unstacking, we can directly use pivot_table to achieve the same result.1
2
3
4
5# pivot table use mean as a default aggregation method
ri.pivot_table(
index = 'violation',
columns = 'driver_gender',
values = 'search_conducted')Using Resampling based on date time
The resulting groups will be the last day of each month, rather than just 1, 2, and 3 like the groupby ones.1
2a = data.num_people.resample('M').mean()
#M indicates month, A indicates year(annual)
Mapping
Dictionary maps the values you have to the values you want.1
2
3
4
5
6# mapping up to True, and down to False
mapping = {'up':True, 'down':False} # before:after
apple['is_up'] = apple.change.map(mapping)
# using mean() to calculate the percentage of up
apple.is_up.mean()
Basic Plots
1 | from matplotlib import pyplot as plt |
Styling Graphs
Change overall styles:1
plt.style.use("fivethirtyeight") #change a set of styles, including the background, legends ...
- Overall Styles: fivethirtyeight, ggplot, seaborn, default, etc.
Change some specific styles:1
2
3
4
5
6
7
8plt.plot(
x_values,
y_values,
color = "tomato", # colors can be found on by "web color" in wikipedia
linewidth = 1, #help to emphasize a certain line
linestyle = "--", #line type
marker = "x" #marker to notify a point
)
Linestyle:
- “–”: dashed line
- “-.”: dot/dash line
- “:”: dotted line
Marker:
- “x”: cross
- “s”: solid square
- “o”: solid circle
- “d”: solid diamond
- “*“: solid pentagon
- “h”: solid hexagon
Adding text to plots
put them before plt.show()1
2
3
4
5
6
7
8
9
10
11
12
13# title
plt.xlabel("x axis title")
plt.ylabel("y axis title")
plt.title("plot title", fontsize = 20, color = 'green') # change text fontsize, color
# legend
plt.plot(x_values, y_values, label = "A") #Label keyword argument, which will show in our legend
plt.plot(x_values, y_values, label = "B")
plt.plot(x_values, y_values, label = "C")
plt.legend() # tells matplotlib to show legends
# text to a certain point
plt.text(xcoord, ycoord, "Text Message")
Making Comparisons
Line graph
Usually deal with time related comparisons. We want to compare the trend of two variables across time.
We can compare two variables after conducting the same aggregation on both of them with one plot.1
2
3
4
5
6
7
8
9
10
11# calculate the aggregation for both variables
monthly_price = apple.price.resample('M').mean()
monthly_volume = apple.volume.resample('M').mean()
# concat the two variables side by side
monthly = pd.concat([monthly_price, monthly_volume], axis = 'columns')
# plot two variables
## specifying to put the two variables into two subplots since two variables often have different scales
monthly.plot(subplots = True)
plt.show()
Bar chart
If we only have one categorical variable, we can simply plot it with a bart chart and sort the order1
2
3
4
5
6
7
8# output a series with categorical values on the index, calculated numbers in the column (sorted in alphabetical order)
search_rate = ri.groupby('violation').search_conducted.mean()
# Sort the series in descending order by its value before plotting
search_rate.sort_values().plot(kind = 'bar')
# makes the label easier to read by rotation the bars
search_rate.sort_values().plot(kind = 'barh')
If we have two categorical variables, we may want to compare the counts of different combinations of categories.1
2
3
4table.plot(kind = 'bar')
#stacked bar chart
table.plot(kind = 'bar', stacked = True)
Box Plot
1 | weather[['TMIN', 'TAVG', 'TMAX']].plot(kind = 'box') |
Strings
Regular Expression
*
represents 1 character